Tweets Language Identification using Feature Weighting

نویسندگان

  • Juglar Díaz Zamora
  • Adrian Fonseca Bruzón
  • Reynier Ortega Bueno
چکیده

This paper describes the language identification method presented in Twitter Language Identification Workshop (TweetLID-2014). The proposed method represents tweets by weighted character-level trigrams. We employed three different weighting schemes used in Text Categorization to obtain a numerical value that represents the relation between trigrams and languages. For each language, we add up the importance of each trigram. Afterward, tweet language is determined by simple majority voting. Finally, we analyze the results.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Term Weighting in Short Documents for Document Categorization, Keyword Extraction and Query Expansion

This thesis focuses on term weighting in short documents. I propose weighting approaches for assessing the importance of terms for three tasks: (1) document categorization, which aims to classify documents such as tweets into categories, (2) keyword extraction, which aims to identify and extract the most important words of a document, and (3) keyword association modeling, which aims to identify...

متن کامل

A Model for Detecting of Persian Rumors based on the Analysis of Contextual Features in the Content of Social Networks

The rumor is a collective attempt to interpret a vague but attractive situation by using the power of words. Therefore, identifying the rumor language can be helpful in identifying it. The previous research has focused more on the contextual information to reply tweets and less on the content features of the original rumor to address the rumor detection problem. Most of the studies have been in...

متن کامل

Short Text Classification Using Deep Representation: A Case Study of Spanish Tweets in Coset Shared Task

Topic identification as a specific case of text classification is one of the primary steps toward knowledge extraction from the raw textual data. In such tasks, words are dealt with as a set of features. Due to high dimensionality and sparseness of feature vector result from traditional feature selection methods, most of the proposed text classification methods for this purpose lack performance...

متن کامل

Classifier Stacking for Native Language Identification

This paper reports our contribution (team WLZ) to the NLI Shared Task 2017 (essay track). We first extract lexical and syntactic features from the essays, perform feature weighting and selection, and train linear support vector machine (SVM) classifiers each on an individual feature type. The output of base classifiers, as probabilities for each class, are then fed into a multilayer perceptron ...

متن کامل

Time-Sensitive Weighting for Microblog Retrieval

We report our system and experiments for the realtime Adhoc task in the 2011 MicroBlog track. Our goal is to develop effective technique to retrieve relevant tweets that have been posted recently. In particular, we propose a time-­sensitive term weighting strategy that can favor tweets in hot-­discussed time and a document length related weighting method that can favor long tweets which are mor...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2014